2017/12/18

Introduction

In order to figure out what Internet users are talking about and how they feel about California fire, this report analyzed data collected from twitter a week after the fire happened, mainly focus on:

1.The frequency of words mentioned by users, showing by Word Cloud.

2.Visualization of sentiments towards the fire among different hashtags and different locations

3.Statistical analysis to generate the population.

Data summary

Total data gathered:7500, 2500 under each set. After gathering location and set the scope to the US, reduced to: 1288 observations for #Californiafire, 876 observations for "California fire" and 1256 observations for #Californiawildfires, together 3240 observations.

Wordcloud for total data

Wordcloud for hashtag #California fire

The first wordcloud for #California fire is approximately neutral

Wordcloud for key words "California fire"

Tweets under key words "California fire" were more likely to be negative, invoving words like illegal, criminal.

Tweets under key words "California fire" were more likely to be negative, invoving words like illegal, criminal.

Wordcloud for hashtag #California wild fires

On the contrast, tweets under this hashtag are more positive, with words like brave and bless.

Histogram of sentiment

The proportion of negative words is larger, and their sentiment are stronger, which means people are more likely to complain for the fire instead of praying for the fire.

Map with sentiment

Here red represents positive sentiment and blue represent regative sentiment. There seems to be more users located on the east coast than on the west coast. Also, the color for tweets from the west coast is darker, especially in south California which means people there have more negative sentiment.

Mapping under #Californiafire

Mapping under 'California fire'

Mapping under #California wild fire

Shiny Interactive Map

As we already seen the sentiment scroe on the map, the interactive map focus on the number of retweets counts of each tweets to generate the popularity of the twitter. In the map, the deeper the color of the popup points, the more popular the tweet content is.

You can zoom the scale of the map and click on every individual points to find out more detail about that tweet, like sentiment socre, user name and retweet cout. Also, you can discover whether retweet counts is related to position of the user. Below is the link for the application: https://sabrina414.shinyapps.io/InteractiveMap/

Statistical Analysis

Summary table

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.00000 -1.00000  0.00000 -0.06316  0.00000  5.00000

From the summary of score we can see the average sentiment score is negative, with minimum of -4 and maximum of 5. This means the overall sentiment is more negative, which is the same conclusion as above.

Test of normality

The histogram shows that the data is approximately normal distributed

ANOVA table

## 
## Call:
## lm(formula = total$retweetCount ~ total$absolute_score)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -322.5  -129.3   -93.7   -53.0 10922.1 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             94.74      10.12   9.362  < 2e-16 ***
## total$absolute_score    46.16      10.37   4.452 8.79e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 462.9 on 3418 degrees of freedom
## Multiple R-squared:  0.005764,   Adjusted R-squared:  0.005474 
## F-statistic: 19.82 on 1 and 3418 DF,  p-value: 8.793e-06

Conclusion is that sentiment score have effect on retweet count, the stronger sentiment is, the more retweet count it would cause.

Smooth Line

The smooth line confirms the conclusion above. However, the dramatic trend may indicate specific relationship between retweet count and sentiment score, which need future investigation.

Is location matter?

Both latitude and longitude are highly insignificant in the anova table below, which means that the strength of attitude is not so related with people's location.

## 
## Call:
## lm(formula = total$absolute_score ~ total$lat + total$lon)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6598 -0.6102 -0.5823  0.4008  4.4138 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.6464250  0.1315383   4.914 9.33e-07 ***
## total$lat   0.0018418  0.0027113   0.679    0.497    
## total$lon   0.0010699  0.0007512   1.424    0.154    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7634 on 3417 degrees of freedom
## Multiple R-squared:  0.0007787,  Adjusted R-squared:  0.0001938 
## F-statistic: 1.331 on 2 and 3417 DF,  p-value: 0.2642

Conclusion

Conclusion: 1.The sentiment of texts is different among different keywords and hashtags. Overall the data set is more negative.

2.The popularity of the text, represented by retweet counts, is related to the strength of attitude.

3.Location doesn't matter for sentiment score

Improvement:

There might be some repetitive texts among these three dataset since the keywords and hashtags are quite similar. these repetitive texts should be removed.

Also, it would be better to overcome the restrict of Google geocode API and gather more data. A larger dataset will saturate this project with more factual evidences.